The goal of this hands-on practical session is to perform a mutational signature analysis of a set of cancer patients and decipher the biological processes underlying the mutations found in the patients’ tumors, and ultimately leading to the development and progression of the disease.
While the analysis of the somatic mutations found in a tumor is commonly done looking for potential driver alterations (defined as those leading to cancer development, and previously reviewed in the course), in the last decade, mutational signature analysis has revolutionized this paradigm. By exploring and characterizing the patterns of all somatic mutations (also including passenger mutations), it is possible to identify the causes leading to a specific tumor, including different sources like environmental exposures (UV light exposure) or lifestyle choices (tobacco smoking), as well as endogenous cellular mechanisms (like a deficient DNA repair machinery). Somatic mutations are present in all cells of the human body and occur throughout life. They are the consequence of multiple mutational processes, which generate unique combinations of mutation types, termed mutational signatures.
To perform mutational signature analysis, we will be using the SigProfiler suite of tools. SigProfiler provides a comprehensive and integrated suite of bioinformatic tools for performing mutational signature analysis. The software covers the analytical lifecycle starting with the generation of the mutational matrices (SigProfilerMatrixGenerator) and finishing with signature extraction (SigProfilerExtractor) and assignment (SigProfilerAssignment), as well as supporting functionality for plotting and simulation.
Our input data will be cancer sequencing data from the publicly
available cBioPortal platform.
In particular, we will analyze data from The
Cancer Genome Atlas (TCGA) project. To focus on this particular
project and, specifically, on the latest published results, you should
search for “tcga pancancer atlas” in the cBioPortal Query
tab, as shown below.
We will perform this exercise using data from colorectal adenocarcinoma patients, and jointly discuss the results together at the end of the session. The goal is to identify the shared patterns of mutations for this specific cancer type, as well as the most common mutational signatures active.
Side-note: For the upcoming group projects session, we will also be using data from the cBioPortal platform, so make sure to spend some time getting familiar with the platform, as well as exploring all the metadata available for the different studies and cancer types.
After selecting the data set of choice, we will choose the
Explore selected studies button. This will lead us to the
Summary panel, where different clinical features can be
explored.
Please explore the clinical and molecular data categories present in cBioPortal and answer the questions below.
Side-note: Please provide your answers in the red rectangles after each question. Correct answers will be indicated by a change of rectangle color.
Note: cancer samples are not the same as cancer patients, as more than one sample can come from the same patient, e.g., because the patient had multiple tumors or there are various biopsies from the same tumor (at different time points or locations).
The data to use for the exercise has already been downloaded and it’s
present in the virtual machine. In case you want to download any cancer
sequencing data from cBioPortal in the future for your own analysis, you
should use the following button within the Summary
panel.
Side-note: The code presented below was built to run the analysis on the colorectal adenocarcinomas cohort. In case you want to perform mutational analysis in other cohorts or for your own projects, you will have to adapt it for your specific analysis.
This will download all available data for the cancer data set,
including clinical and molecular data. In our case, the file of interest
is the one containing the mutation data, which is named
data_mutations.txt. Although the extension of this file is
.txt, it was actually generated using the MAF format,
commonly seen in cancer sequencing data and previously reviewed in the
course.
Unfortunately, the MAF format used by cBioPortal is slightly
different from the one developed by the National
Cancer Institute, which is the default supported by the SigProfiler
suite of tools. Considering this, we will reformat the file using the
tidyverse collection of packages as follows:
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Read cBioPortal style MAF file
maf_cbioportal = read.delim('coadread_tcga_pan_can_atlas_2018/data_mutations.txt')
# Selection of specific columns needed by SigProfiler
maf_sp = maf_cbioportal %>%
select(Hugo_Symbol, Entrez_Gene_Id, Center, NCBI_Build, Chromosome,
Start_Position, End_Position, Strand, Variant_Classification,
Variant_Type, Reference_Allele, Tumor_Seq_Allele1,
Tumor_Seq_Allele2, dbSNP_RS, dbSNP_Val_Status, Tumor_Sample_Barcode)
# Filter for only considering single base substitutions
maf_sp = maf_sp %>%
filter(Variant_Type == 'SNP')We will also create a specific folder for all our signature analysis results, as well as a specific folder inside for the updated MAF file, as this is also required by the SigProfiler tools.
# Create new folder for signature analysis results
dir.create('signatures')
# Create new folder for updated MAF file (needed for SigProfilerMatrixGenerator)
dir.create('signatures/SPMG/')
# Write updated MAF file
write.table(maf_sp, 'signatures/SPMG/data_mutations.maf', quote = F,
row.names = F, sep = '\t')Benefiting from the standard MAF format file we have generated, we can now use the SigProfilerMatrixGenerator package to manage this sequencing data. This package allows us to create a matrix classifying the mutations in the MAF file into the subtypes of interest. Mutational matrices are the first step for mutational signature analysis and correspond to a helpful data type, as they contain no protected information.
In this case, we are going to focus on the SBS96 mutational context, which, as mentioned in the lecture, allows classifying single base substitutions in different categories based on the mutated nucleotide, as well as the immediately preceding and posterior nucleotides to the mutation (a.k.a. the 5’ and 3’ nucleotides from the mutation). Other contexts exist for single base substitutions, as well as additional variant types, such as doublet base substitutions (DBSs) or short insertions and deletions (indels; IDs).
Although SigProfilerMatrixGenerator and the other SigProfiler tools that we will be using today have been developed in Python, there are R wrappers available. These make it easy to switch between the two different platforms. In the context of this course, we will use the R version of all SigProfiler tools (note the final R added to the name of the packages). More information can be found on the associated GitHub pages for the python and R versions of SigProfilerMatrixGenerator, as well as in the corresponding wiki page.
Side-note: Although the SigProfiler suite of tools has already been pre-installed in the virtual machine for this course, if you are interested in applying mutational signature analysis in your own projects, please take a look at the optional section at the end of this document for installation instructions.
# Load reticulate library (for using python packages in R)
library(reticulate)
# Fixing conda environment (check optional section for details)
use_condaenv('mutational_signatures')
# Load R wrapper package for SigProfilerMatrixGenerator
library(SigProfilerMatrixGeneratorR)The first step to run SigProfilerMatrixGenerator is installing a reference genome, that should match the one used for the alignment of the next generation sequencing data. We have already preinstalled human reference genomes GRCh37 and GRCh38 in the virtual machines, but in case you need to install these genomes (or different ones) on a different computer you can follow the code below:
# Install reference genome (only required once, previously done in the VM)
install('GRCh37', rsync=FALSE, bash=TRUE)In order to run SigProfilerMatrixGenerator for the colorectal adenocarcinoma tumors from the TCGA project available in cBioPortal you can use the following:
# Generate mutational profiles analysis using SigProfilerMatrixGenerator
matrices <- SigProfilerMatrixGeneratorR(project = "COREAD",
genome = "GRCh37",
matrix_path = "./signatures/SPMG",
plot = F,
exome = T)A successful run will indicate that mutational matrices have been
generated for a total of 528 samples and 305,525 single base
substitutions. Please review the output files generated by
SigProfilerMatrixGenerator by exploring the folder structure created
inside the SPMG directory.
For the visualization of SBS96 mutational profiles, we will make use of the SigProfilerPlotting tool. To generate mutational profile plots, we will use the previously generated mutational matrices as input.
Side-note: Although SigProfilerMatrixGenertor also allows users to directly generate mutational profile plots, this is quite time consuming, as plots are generated for every different SBS classification (as well as for every sample).
library(SigProfilerPlottingR)
plotSBS(matrix_path = 'signatures/SPMG/output/SBS/COREAD.SBS96.exome',
output_path = 'signatures/SPMG/output/SBS/',
project = 'COREAD',
plot_type = '96',
percentage = FALSE)The file containing the mutational profiles for all samples will be
located in the directory specified in the output_path
parameter. The mutational profiles for the first three samples of the
cohort are provided as example below:
To get an idea of the overall patterns of mutations in the whole cohort, it is useful to generate the average mutational profile of all samples. To do this, it’s important to keep in mind that we need to first obtain the relative mutational matrix, using percentages instead of absolute values. This step is required to avoid samples with high numbers of mutations to bias the average mutational profile.
Side-note: You can try to see what happens if you generate the average mutational profile directly with the absolute values instead of the relative ones. Can you observe any significant difference? Why do you think this happens?
# Generate average mutational profiles
mut_matrix = matrices[['96']]
# Get relative mutational matrix
relative_mut_matrix = apply(mut_matrix, 2, prop.table)
# Get average mutational matrix
average_mut_matrix = rowMeans(relative_mut_matrix)
average_mut_matrix = data.frame(Average_COREAD = average_mut_matrix)
# Add row names as column and print
average_mut_matrix_to_print = cbind(rownames(average_mut_matrix),
average_mut_matrix)
colnames(average_mut_matrix_to_print)[1] = 'MutationType'
write.table(average_mut_matrix_to_print, 'signatures/avg_COREAD.SBS96.all',
quote = F, row.names = F, sep = '\t')
# Plot average mutational profiles (note the percentage parameter now)
plotSBS(matrix_path = 'signatures/avg_COREAD.SBS96.all',
output_path = 'signatures/',
project = 'avg_COREAD',
plot_type = '96',
percentage = TRUE)An interesting analysis when working with large cohorts of cancer cases is to identify differences according to specific subgroups based on clinical characteristics. In the case of colorectal cancer, there is several molecular subgroups that have been previously identified to have significant differences in the number and pattern of mutations (Muzny et al. 2012 Nature).
This have been annotated for the cBioPortal data using the
Subtype column in the clinical metadata file provided.
However, to start working with this metadata, we need to first
filter it to use only the samples where we have mutation
information.
# Read clinical file with metadata
metadata = read.delim('coadread_tcga_pan_can_atlas_2018/coadread_tcga_pan_can_atlas_2018_clinical_data.tsv')
# Filtering metadata file to use only samples where we have mutation information
metadata = metadata %>%
filter(Sample.ID %in% maf_sp$Tumor_Sample_Barcode)We are going to consider the main four subgroups for colon cancer,
annotated as COAD_CIN, COAD_GS,
COAD_MSI, and COAD_POLE. To obtain the average
mutational profile for the COAD_CIN subgroup you can use
the following code:
# Get samples from group
samples_group = metadata %>%
filter(Subtype == 'COAD_CIN') %>%
pull(Sample.ID)
# Select group samples from main matrix and get average
mm_group = rowMeans(relative_mut_matrix[,samples_group])
mm_group = data.frame(mm_group)
# Add row names as column and print
mm_group_to_print = cbind(rownames(mm_group),mm_group)
colnames(mm_group_to_print) = c('MutationType', 'COAD_CIN')
write.table(mm_group_to_print,
'signatures/avg_COAD_CIN.SBS96.all',
quote = F, row.names = F, sep = '\t')
# Plot average mutational profiles (note the percentage parameter now)
plotSBS(matrix_path = 'signatures/avg_COAD_CIN.SBS96.all',
output_path = 'signatures/',
project = 'avg_COAD_CIN.SBS96.all',
plot_type = '96',
percentage=TRUE)Please explore the provided metadata from cBioPortal and answer the questions below.
COAD_GS, COAD_MSI, and
COAD_POLE subgroups, and compare them agains the one
previously done for COAD_CIN.These are the expected average mutational profiles for the two
indicated subgroups. Do you observe any differences with the one shown
above for the COAD_CIN subgroup?